DRAFT add Data Health Checker #1574
Draft
+189
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
First draft for a data health checker as discussed in #854. The checker receives a path to the data in CSV or qlib format (not implemented yet). It will convert the data to a DataFrame and perform basic checks for data completeness and correctness.
I am not too familiar with the qlib data handling yet, so I am hoping to get some first feedback on whether this goes in the right direction.
Motivation and Context
See #854. This was an issue where a user would get a non-meaningful error message when his data did not adhere to the format (specifically the "volume" column was named "vol"). When checking the data of #854 with this checker, the user would get:
Note: the large step change uses two configurable thresholds (one for price and one for volume) and checks only step changes in OHLCV columns.
How Has This Been Tested?
No tests yet as this is only a first draft
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
.Screenshots of Test Results (if appropriate):
Types of changes